A. Sample size of the data.

first load the data, and then take the sample size with nrow() function.

wine <- read.csv("data/winequality-red.csv", sep = ";")
sample_size <- nrow(wine)
print(paste('sample size is ', sample_size))
## [1] "sample size is  1599"

B. Identify outliers

Draw a plot for each of the variables:

for (col_name in colnames(wine))
  plot(wine[[col_name]], main = paste("distribution of ", col_name))

seems we have some outliers observed in the total.sulfur.dioxide variable.

C. Summarize of data.

The summary() function provides a basic summary of Min, 1st Quantile, median, third quntile, max:

summary(wine)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Moreover, I would like to include standard deviation to give a little bit more insight:

for (col_name in colnames(wine)) {
  sd = sd(wine[[col_name]])
  print(paste("standard deviation of ", col_name, ": ", round(sd, 2)))
}
## [1] "standard deviation of  fixed.acidity :  1.74"
## [1] "standard deviation of  volatile.acidity :  0.18"
## [1] "standard deviation of  citric.acid :  0.19"
## [1] "standard deviation of  residual.sugar :  1.41"
## [1] "standard deviation of  chlorides :  0.05"
## [1] "standard deviation of  free.sulfur.dioxide :  10.46"
## [1] "standard deviation of  total.sulfur.dioxide :  32.9"
## [1] "standard deviation of  density :  0"
## [1] "standard deviation of  pH :  0.15"
## [1] "standard deviation of  sulphates :  0.17"
## [1] "standard deviation of  alcohol :  1.07"
## [1] "standard deviation of  quality :  0.81"

D. Visualize the distribution of each variable.

Draw a histogram of each variable with hist() function, and draw a density curve on top of it.

for (col_name in colnames(wine)) {
  hist(wine[[col_name]], main = col_name, freq = F)
  lines(density(wine[[col_name]]), lwd = 5, col = "blue")
}

E. Any skewed distribution in D?

Yes, many variables like citric.acid, free.sulfur.dioxide, total.sulfur.dioxiode, alcohol, are skewed.

F. What data mining methods are used in this paper?

The author discussed linear/multiple regression (MR), neural networks (NN), and support vector machines (SVM). MR can be seen as a reduced form of NN when there’s no layer of hidden node. Empirical results shows that SVM outperformed NN (and also MR) in this study case, especially for white wine.